Skip to content

Conversation

@karenfeng
Copy link
Collaborator

What changes are proposed in this pull request?

Creates documentation for WGR.

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

henrydavidge and others added 30 commits May 15, 2020 09:58
Add Leland's demo notebook
…or WGR (#2)

* blocks

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test vcf

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* transformer

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* remove extra

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* refactor and conform with ridge namings

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* test files

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* remove extra file

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* sort_key

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
* feat: ridge models for wgr added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Doc strings added for levels/functions.py
Some typos fixed in ridge_model.py
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* ridge_model and RidgeReducer unit tests added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* RidgeRegression unit tests added
test data README added
ridge_udfs.py docstrings added
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Changes made to accessing the sample ID map and more docstrings

The map_normal_eqn and score_models functions previously expected the
sample IDs for a given sample block to be found in the Pandas DataFrame,
which mean we had to join them on before the .groupBy().apply().  These
functions now expect the sample block to sample IDs mapping to be
provided separately as a dict, so that the join is no longer required.
RidgeReducer and RidgeRegression APIs remain unchanged.

docstrings have been added for RidgeReducer and RidgeRegression classes.

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Refactored object names and comments to reflect new terminology

Where 'block' was previously used to refer to the set of columns in a
block, we now use 'header_block'
Where 'group' was previously used to refer to the set of samples in a
block, we now use 'sample_block'

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>
* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* existing tests pass

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename file

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add compat test

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* scalafmt

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* collect minimal columns

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* address comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Spark 3 needs more recent PyArrow, reduce mem consumption by removing unnecessary caching

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* PyArrow 0.15.1 only with PySpark 3

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Don't use toPandas()

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Upgrade pyarrow

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Only register once

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Minimize memory usage

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Select before head

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* set up/tear down

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Try limiting pyspark memory

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* No teardown

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Extend timeout

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* existing tests pass

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename file

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add compat test

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* scalafmt

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* collect minimal columns

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* start changing for readability

* use input label ordering

* rename create_row_indexer

* undo column sort

* change reduce

Signed-off-by: Henry D <henrydavidge@gmail.com>

* further simplify reduce

* sorted alpha names

* remove ordering

* comments

Signed-off-by: Henry D <henrydavidge@gmail.com>

* Set arrow env var in build

Signed-off-by: Henry D <henrydavidge@gmail.com>

* faster sort

* add test file

* undo test data change

* >=

* formatting

* empty

Co-authored-by: Karen Feng <karen.feng@databricks.com>
* yapf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* yapf transform

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Set driver memory

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Try changing spark mem

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* match java tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* remove driver memory flag

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* whoops

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* simplify tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* yapf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* index map compat

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add more tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* pass args as ints

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Don't roll our own splitter

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename sample_index to sample_blocks

Signed-off-by: Karen Feng <karen.feng@databricks.com>
* Add type-checking to APIs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check valid alphas

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* check 0 sig

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add to install_requires list

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>
* Added necessary modifications to accomodate covariates in model fitting.

The initial formulation of the WGR model assumed a form y ~ Xb, however in general we would like to use a model of the form y ~ Ca + Xb, where C is some matrix of covariates that are separate from the genomic features X.  This PR makes numerous changes to accomodate covariate matrix C.

Adding covariates required the following breaking changes to the APIs:
 * indexdf is now a required argument for RidgeReducer.transform() and RidgeRegression.transform():
   * RidgeReducer.transform(blockdf, labeldf, modeldf) -> RidgeReducer.transform(blockdf, labeldf, indexdf, modeldf)
   * RidgeRegression.transform(blockdf, labeldf, model, cvdf) -> RidgeRegression.transform(blockdf, labeldf, indexdf, model, cvdf)

Additionally, the function signatures for the fit and transform methods of RidgeReducer and RidgeRegression have all been updated to accomodate an optional covariate DataFrame as the final argument.

Two new tests have been added to test_ridge_regression.py to test run modes with covariates:
 * test_ridge_reducer_transform_with_cov
 * test_two_level_regression_with_cov

Signed-off-by: Leland Barnard (leland.barnard@gmail.com)
Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Cleaned up one unnecessary Pandas import
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Small changes for clarity and consistence with the rest of the code.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Forgot one usage of coalesce
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Added a couple of comments to explain logic and replaced usages of .values with .array
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Fixed one instance of the change .values -> .array where it was made in error.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Typo in test_ridge_regression.py.
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

* Style auto-updates with yapfAll
Signed-off-by: Leland Barnard (leland.barnard@gmail.com)

Signed-off-by: Leland Barnard <leland.barnard@regeneron.com>

Co-authored-by: Leland Barnard <leland.barnard@regeneron.com>
Co-authored-by: Karen Feng <karen.feng@databricks.com>
* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Order to match labeldf

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Check we tie-break

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* cleanup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* test var name

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* clean up tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Clean up docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
…wgr-docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
karenfeng and others added 11 commits June 22, 2020 08:06
* Rename levels to wgr

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* rename test files

Signed-off-by: Karen Feng <karen.feng@databricks.com>
* headers

* executable

* fix template rendering

* yapf
Signed-off-by: Karen Feng <karen.feng@databricks.com>
…wgr-docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
…-docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
@codecov
Copy link

codecov bot commented Jun 22, 2020

Codecov Report

Merging #235 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #235   +/-   ##
=======================================
  Coverage   93.75%   93.75%           
=======================================
  Files          90       90           
  Lines        4339     4339           
  Branches      406      406           
=======================================
  Hits         4068     4068           
  Misses        271      271           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d3ad87...75ffd4c. Read the comment docs.

…-docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Copy link
Contributor

@williambrandler williambrandler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments and clarifications!


The genotype data may be read from any variant datasource supported by Glow, such as VCF, BGEN or PLINK. The DataFrame
must also include a column ``values`` containing a numeric representation of each genotype. The genotypic values may
not be missing, or equal for every sample in a variant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does equal mean here? All homozygous reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mathematically, we're trying to filter out variants for which all samples have the same calls and therefore values has a variance/stddev of 0 (eg.
all hom ref, all hom-alt, or even all het). I'm not sure what the best way to phrase this is.

- Split multiallelic variants with the ``split_multiallelics`` transformer.
- Calculate the number of alternate alleles for biallelic variants with ``glow.genotype_states``.
- Replace any missing values with the mean of the non-missing values using ``glow.mean_substitute``.
- Filter out all homozygous SNPs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter out all SNPs that contain zero non-reference alleles

The fields in the model DataFrame are:

- ``header_block``: An ID assigned to the block x0 corresponding to the coefficients in this row.
- ``sample_block``: An ID assigned to the block x0 corresponding to the coefficients in this row.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header_block and sample_block have the same description?

…-docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Copy link
Contributor

@williambrandler williambrandler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having a comment up front that GlowGR only supports quantitative phenotypes for now, and we plan to implement binary traits in the near future?

Otherwise LGTM

@karenfeng
Copy link
Collaborator Author

Is it worth having a comment up front that GlowGR only supports quantitative phenotypes for now, and we plan to implement binary traits in the near future?

Otherwise LGTM

I added a note that this only supports quantitative phenotypes. I'm going to avoid making promises in our docs.

Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Copy link
Contributor

@henrydavidge henrydavidge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! Thanks @karenfeng !

@karenfeng karenfeng merged commit e0680a7 into master Jun 23, 2020
@henrydavidge henrydavidge deleted the wgr-docs branch August 5, 2020 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants